library(datasets)
First lets bring in the Pokemon dataset.
Pokemon = read.csv(file = 'data/Pokemon.csv')
head(Pokemon)
## X. Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122
## 5 4 Charmander Fire 309 39 52 43 60
## 6 5 Charmeleon Fire 405 58 64 58 80
## Sp..Def Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 120 80 1 False
## 5 50 65 1 False
## 6 65 80 1 False
What is the most common type for pokemon?
Since there are secondary typing for Pokemon, we will have to combine the observations from the two variables.
But first lets see what the data shows for Pokemon that only have one type.
# this is the Type.2 for charmander
# a Pokemon that has a singular "Fire" type
Pokemon$Type.2[5]
## [1] ""
The data displays a “” or empty for a Pokemon that has to Type.2. This will muddy our data if we’re looking for the most common type, since technically “” is a valid observation as far as the code is concerned.
These are the tables for Type.1 and Type.2 variables.
table(Pokemon$Type.1)
##
## Bug Dark Dragon Electric Fairy Fighting Fire Flying
## 69 31 32 44 17 27 52 4
## Ghost Grass Ground Ice Normal Poison Psychic Rock
## 32 70 32 24 98 28 57 44
## Steel Water
## 27 112
table(Pokemon$Type.2)
##
## Bug Dark Dragon Electric Fairy Fighting Fire
## 386 3 20 18 6 23 26 12
## Flying Ghost Grass Ground Ice Normal Poison Psychic
## 97 14 25 35 14 4 34 33
## Rock Steel Water
## 14 22 14
# how many types are in Type.1
length(table(Pokemon$Type.1))
## [1] 18
# how many types are in Type.2
length(table(Pokemon$Type.2))
## [1] 19
As you can see there is an extra type in
table(Pokemon$Type.2) that is “” with the number 386 to
represent all of the observations/Pokemon with “” for Type.2.
This is the combined table of Type.1 and Type.2, minus the empty
“” type Now we need to be careful, because this does not represent every
Pokemon. As you can see when we look at the sum of the table, it shows
1214 when we only have 800 observations/Pokemon.
combinedTypeTable = (table(Pokemon$Type.1) + table(Pokemon$Type.2)[-1])
paste("The sum is: ", sum(combinedTypeTable))
## [1] "The sum is: 1214"
Finally lets see which type is the most common for pokemon, either as their Type.1 OR their Type.2.
# finds the max(table), and then finds the index of the max value, then returns both the number and its name, in our case the type
combinedTypeTable[which(combinedTypeTable == max(combinedTypeTable))]
## Water
## 126
As you can see, the most common typing for a Pokemon is Water.
But as always a visualization will help in contextualizing this piece of information.
combinedTypeTable = sort(combinedTypeTable, decreasing=TRUE)
barplot(combinedTypeTable[1:5], main = "TOP 5 MOST COMMON TYPES IN POKEMON", ylab="Frequency", col=c("lightblue"))
What generation has the greatest number of each “type”(Type.1 and Type.2 combined) of Pokemon
First we need to know how many different generations are present in our data.
table(Pokemon$Generation)
##
## 1 2 3 4 5 6
## 166 106 160 121 165 82
There are 6 different generations of Pokemon listed in the data, with a varying number of Pokemon for each generation.
Now we need to parse the data so that we can see how many of each type is in each generation.
# table for the number of each Type.1 type for all entries where the Generation is 1
table(Pokemon$Type.1[which(Pokemon$Generation == 1)])
##
## Bug Dragon Electric Fairy Fighting Fire Ghost Grass
## 14 3 9 2 7 14 4 13
## Ground Ice Normal Poison Psychic Rock Water
## 8 2 24 14 11 10 31
# table for the number of each Type.2 type for all entries where the Generation is 1
table(Pokemon$Type.2[which(Pokemon$Generation == 1)])
##
## Dark Dragon Fairy Fighting Flying Grass Ground
## 88 1 1 3 2 23 2 6
## Ice Poison Psychic Rock Steel Water
## 3 22 7 2 2 4
length(table(Pokemon$Type.1[which(Pokemon$Generation == 1)]))
## [1] 15
length(table(Pokemon$Type.2[which(Pokemon$Generation == 1)]))
## [1] 14
But we run in to a problem due to the way that Pokemon types are set up. A Pokemon has one or two types, and the Type.2 is valued the same as Type.1. This means that we need to count both when looking at how many types of each Pokemon are in each generation. For example the first Pokemon in the data, Bulbasaur, would count as both a “Grass” and “Poison” type, with both holding equal weight. This causes discrepancies like we see above, where some types are missing from the Type.2 table and the Type.1 table . We will have to modify the table ourselves to make this work.
Since we will most likely have to do this for every generation, lets make a function that fills in the missing “types” and returns one vector with the combined values for both Type.1 and Type.2
# this function will help us combine the two Type tables int o one table with all unique types from both Type.1 and Type.2, as well as add up duplicates
combineTypeTblByGen = function(generation){
# merging the table of Type.1 and Type.2 for the generation
mergedTypeTable = merge(table(Pokemon$Type.1[which(Pokemon$Generation == generation)]), table(Pokemon$Type.2[which(Pokemon$Generation == generation)])[-1], all= TRUE)
# vector version of the merged tables above
typeVec = mergedTypeTable[[2]]
names(typeVec) = mergedTypeTable[[1]]
# for loop that adds up the repeat values for the types, and then removes the duplicate name+value from the vector
len = length(typeVec) - 1
for(i in 1:len){
if(i >= length(typeVec)){break}
else if(names(typeVec[i]) == names(typeVec[i + 1])){
typeVec[i+1] = typeVec[i] + typeVec[i+1]
#print(typeVec[i])
typeVec = typeVec[-c(i)]
}
}
return(typeVec)
}
barplot(combineTypeTblByGen(1), las=2, main="Frequency of Types for Generation 1")
The plot above combines frequency of “types” in both Type1 and Type2 for generation 1, and because we have the function combineTypeTblByGen() we are able to replicate this for every other generation as well.
Since we have the combined tables for every generation, we can move on to comparing the generations by Type.
This next function will allow us to see how many of one specific type is in every generation. For example it will allow us to see the number of “Bug” types there are in generation 1 through 6.
# function takes in a character argument that represents a Pokemon Type and returns a vector of length 6, with each index representing the number of that type of pokemon are in that generation number
compareTypeAcrossGen = function(type){
oneTypeAllGensVec = c()
for(i in 1:6){
#print(combineTypeTblByGen(i)[type])
oneTypeAllGensVec[i] = combineTypeTblByGen(i)[type]
}
names(oneTypeAllGensVec) = c("Gen 1", "Gen 2", "Gen 3", "Gen 4", "Gen 5", "Gen 6")
return(oneTypeAllGensVec)
}
compareTypeAcrossGen("Bug")
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 14 12 14 11 18 3
compareTypeAcrossGen("Dark")
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 1 8 13 7 16 3
barplot(compareTypeAcrossGen("Bug"), main="Frequency of Bug Types in Each Generation(Gen)", las=2, col = "lightgreen")
We have the comparisons across generations for individual types, but that doesn’t accomplish anything unless we want to create 18 different barplots. Instead lets use the plotly package to create an interactive barplot that shows the frequency of each type across the generations to get a broader understanding of the dataset.
typeByGenTraces[]
## [[1]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 14 12 14 11 18 3
##
## [[2]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 4 2 15 4 12 9
##
## [[3]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 9 9 5 12 12 3
##
## [[4]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 5 8 8 1 3 14
##
## [[5]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 9 2 9 10 17 4
##
## [[6]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 14 11 9 6 16 8
##
## [[7]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 4 1 8 9 9 15
##
## [[8]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 15 10 18 17 20 15
##
## [[9]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 14 11 16 12 12 2
##
## [[10]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 5 5 7 8 9 2
##
## [[11]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 24 15 18 18 19 4
##
## [[12]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 36 4 5 8 7 2
##
## [[13]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 18 10 28 10 16 8
##
## [[14]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 12 8 12 7 10 9
##
## [[15]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 35 18 31 15 18 9
##
## [[16]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 1 8 13 7 16 3
##
## [[17]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 23 19 14 16 21 8
##
## [[18]]
## Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
## 2 3 12 12 12 5
q2FinalGraph
library(DBI)
library(RSQLite)
## Warning: package 'RSQLite' was built under R version 4.4.3
drv = dbDriver("SQLite")
pokemonDatabase = dbConnect(drv, dbname = "./data/veekun-pokedex.sqlite")
dbExecute(pokemonDatabase, "PRAGMA foreign_keys = on")
## [1] 0
dbListTables(pokemonDatabase)
## [1] "abilities" "ability_changelog"
## [3] "ability_changelog_prose" "ability_flavor_text"
## [5] "ability_names" "ability_prose"
## [7] "berries" "berry_firmness"
## [9] "berry_firmness_names" "berry_flavors"
## [11] "characteristic_text" "characteristics"
## [13] "conquest_episode_names" "conquest_episode_warriors"
## [15] "conquest_episodes" "conquest_kingdom_names"
## [17] "conquest_kingdoms" "conquest_max_links"
## [19] "conquest_move_data" "conquest_move_displacement_prose"
## [21] "conquest_move_displacements" "conquest_move_effect_prose"
## [23] "conquest_move_effects" "conquest_move_range_prose"
## [25] "conquest_move_ranges" "conquest_pokemon_abilities"
## [27] "conquest_pokemon_evolution" "conquest_pokemon_moves"
## [29] "conquest_pokemon_stats" "conquest_stat_names"
## [31] "conquest_stats" "conquest_transformation_pokemon"
## [33] "conquest_transformation_warriors" "conquest_warrior_archetypes"
## [35] "conquest_warrior_names" "conquest_warrior_rank_stat_map"
## [37] "conquest_warrior_ranks" "conquest_warrior_skill_names"
## [39] "conquest_warrior_skills" "conquest_warrior_specialties"
## [41] "conquest_warrior_stat_names" "conquest_warrior_stats"
## [43] "conquest_warrior_transformation" "conquest_warriors"
## [45] "contest_combos" "contest_effect_prose"
## [47] "contest_effects" "contest_type_names"
## [49] "contest_types" "egg_group_prose"
## [51] "egg_groups" "encounter_condition_prose"
## [53] "encounter_condition_value_map" "encounter_condition_value_prose"
## [55] "encounter_condition_values" "encounter_conditions"
## [57] "encounter_method_prose" "encounter_methods"
## [59] "encounter_slots" "encounters"
## [61] "evolution_chains" "evolution_trigger_prose"
## [63] "evolution_triggers" "experience"
## [65] "genders" "generation_names"
## [67] "generations" "growth_rate_prose"
## [69] "growth_rates" "item_categories"
## [71] "item_category_prose" "item_flag_map"
## [73] "item_flag_prose" "item_flags"
## [75] "item_flavor_summaries" "item_flavor_text"
## [77] "item_fling_effect_prose" "item_fling_effects"
## [79] "item_game_indices" "item_names"
## [81] "item_pocket_names" "item_pockets"
## [83] "item_prose" "items"
## [85] "language_names" "languages"
## [87] "location_area_encounter_rates" "location_area_prose"
## [89] "location_areas" "location_game_indices"
## [91] "location_names" "locations"
## [93] "machines" "move_battle_style_prose"
## [95] "move_battle_styles" "move_changelog"
## [97] "move_damage_class_prose" "move_damage_classes"
## [99] "move_effect_changelog" "move_effect_changelog_prose"
## [101] "move_effect_prose" "move_effects"
## [103] "move_flag_map" "move_flag_prose"
## [105] "move_flags" "move_flavor_summaries"
## [107] "move_flavor_text" "move_meta"
## [109] "move_meta_ailment_names" "move_meta_ailments"
## [111] "move_meta_categories" "move_meta_category_prose"
## [113] "move_meta_stat_changes" "move_names"
## [115] "move_target_prose" "move_targets"
## [117] "moves" "nature_battle_style_preferences"
## [119] "nature_names" "nature_pokeathlon_stats"
## [121] "natures" "pal_park"
## [123] "pal_park_area_names" "pal_park_areas"
## [125] "pokeathlon_stat_names" "pokeathlon_stats"
## [127] "pokedex_prose" "pokedex_version_groups"
## [129] "pokedexes" "pokemon"
## [131] "pokemon_abilities" "pokemon_color_names"
## [133] "pokemon_colors" "pokemon_dex_numbers"
## [135] "pokemon_egg_groups" "pokemon_evolution"
## [137] "pokemon_form_generations" "pokemon_form_names"
## [139] "pokemon_form_pokeathlon_stats" "pokemon_forms"
## [141] "pokemon_game_indices" "pokemon_habitat_names"
## [143] "pokemon_habitats" "pokemon_items"
## [145] "pokemon_move_method_prose" "pokemon_move_methods"
## [147] "pokemon_moves" "pokemon_shape_prose"
## [149] "pokemon_shapes" "pokemon_species"
## [151] "pokemon_species_flavor_summaries" "pokemon_species_flavor_text"
## [153] "pokemon_species_names" "pokemon_species_prose"
## [155] "pokemon_stats" "pokemon_types"
## [157] "region_names" "regions"
## [159] "stat_names" "stats"
## [161] "super_contest_combos" "super_contest_effect_prose"
## [163] "super_contest_effects" "type_efficacy"
## [165] "type_game_indices" "type_names"
## [167] "types" "version_group_pokemon_move_methods"
## [169] "version_group_regions" "version_groups"
## [171] "version_names" "versions"
This database and tables needs to be “cleaned” or organized into the
information that I actually need –> practice organizing and putting
data togehter with joins
Questions to ask/analyze:
how has the stats fro certain types changed over the course of the
generartions?
is this change significant?
Pokedex app? image, stats, and reads pokedex entry? check normality for stats by type
hypothesis: flying types are significantly faster than ground, check?
When I think of flying type Pokemon, I imagine fast aerial predators,
so naturally you would think that their speed stat is significantly
higher than that of ground/rock types that are usually large boulders
and giants. But do the stats, get it?, actually back this up?
Lets Find Out!
Step 1: get a data frame with every flying type and their
speed stat and every ground type and their speed stat
Lets build this dataframe step by step.
Step 1a: get a dataframe with every Pokemon and its speed
stat
dbGetQuery(pokemonDatabase,
"PRAGMA FOREIGN_KEY_LIST(pokemon_stats)")
## id seq table from to on_update on_delete match
## 1 0 0 stats stat_id id NO ACTION NO ACTION NONE
## 2 1 0 pokemon pokemon_id id NO ACTION NO ACTION NONE
dbGetQuery(pokemonDatabase, "
SELECT pokemon_id, identifier, base_stat
FROM pokemon_stats INNER JOIN stats
ON pokemon_stats.stat_id = stats.id
WHERE identifier = 'speed'
LIMIT 20")
## pokemon_id identifier base_stat
## 1 1 speed 45
## 2 2 speed 60
## 3 3 speed 80
## 4 4 speed 65
## 5 5 speed 80
## 6 6 speed 100
## 7 7 speed 43
## 8 8 speed 58
## 9 9 speed 78
## 10 10 speed 45
## 11 11 speed 30
## 12 12 speed 70
## 13 13 speed 50
## 14 14 speed 35
## 15 15 speed 75
## 16 16 speed 56
## 17 17 speed 71
## 18 18 speed 101
## 19 19 speed 72
## 20 20 speed 97
Step 1b: dataframe with every pokemon’s type
# dbGetQuery(pokemonDatabase, "
# SELECT *
# FROM pokemon_types")
#
# dbGetQuery(pokemonDatabase, "
# PRAGMA FOREIGN_KEY_LIST(pokemon_types)")
#
# dbGetQuery(pokemonDatabase, "
# SELECT *
# FROM types")
dbGetQuery(pokemonDatabase, "
PRAGMA FOREIGN_KEY_LIST(pokemon_types)
")
## id seq table from to on_update on_delete match
## 1 0 0 types type_id id NO ACTION NO ACTION NONE
## 2 1 0 pokemon pokemon_id id NO ACTION NO ACTION NONE
dbGetQuery(pokemonDatabase, "
SELECT pokemon_id, slot, identifier as TYPE
FROM pokemon_types INNER JOIN types
ON pokemon_types.type_id = types.id
LIMIT 20")
## pokemon_id slot TYPE
## 1 1 1 grass
## 2 1 2 poison
## 3 2 1 grass
## 4 2 2 poison
## 5 3 1 grass
## 6 3 2 poison
## 7 4 1 fire
## 8 5 1 fire
## 9 6 1 fire
## 10 6 2 flying
## 11 7 1 water
## 12 8 1 water
## 13 9 1 water
## 14 10 1 bug
## 15 11 1 bug
## 16 12 1 bug
## 17 12 2 flying
## 18 13 1 bug
## 19 13 2 poison
## 20 14 1 bug
Step 1c: Now that we have two dataframes with every pokemon and its speed stat and another with every Pokemon and its type, we need a dataframe that combines these two.
# initial query with all required columns
dbGetQuery(pokemonDatabase, "
SELECT pokemon.id, pokemon.identifier, pokemon_types.pokemon_id, pokemon_types.type_id, types.identifier, pokemon_stats.stat_id, pokemon_stats.base_stat, stats.identifier
FROM pokemon_types
INNER JOIN pokemon ON pokemon_types.pokemon_id = pokemon.id
INNER JOIN types ON pokemon_types.type_id = types.id
INNER JOIN pokemon_stats ON pokemon_stats.pokemon_id = pokemon.id
INNER JOIN stats ON pokemon_stats.stat_id = stats.id
LIMIT 20
")
## id identifier pokemon_id type_id identifier stat_id base_stat
## 1 1 bulbasaur 1 12 grass 1 45
## 2 1 bulbasaur 1 12 grass 2 49
## 3 1 bulbasaur 1 12 grass 3 49
## 4 1 bulbasaur 1 12 grass 4 65
## 5 1 bulbasaur 1 12 grass 5 65
## 6 1 bulbasaur 1 12 grass 6 45
## 7 1 bulbasaur 1 4 poison 1 45
## 8 1 bulbasaur 1 4 poison 2 49
## 9 1 bulbasaur 1 4 poison 3 49
## 10 1 bulbasaur 1 4 poison 4 65
## 11 1 bulbasaur 1 4 poison 5 65
## 12 1 bulbasaur 1 4 poison 6 45
## 13 2 ivysaur 2 12 grass 1 60
## 14 2 ivysaur 2 12 grass 2 62
## 15 2 ivysaur 2 12 grass 3 63
## 16 2 ivysaur 2 12 grass 4 80
## 17 2 ivysaur 2 12 grass 5 80
## 18 2 ivysaur 2 12 grass 6 60
## 19 2 ivysaur 2 4 poison 1 60
## 20 2 ivysaur 2 4 poison 2 62
## identifier
## 1 hp
## 2 attack
## 3 defense
## 4 special-attack
## 5 special-defense
## 6 speed
## 7 hp
## 8 attack
## 9 defense
## 10 special-attack
## 11 special-defense
## 12 speed
## 13 hp
## 14 attack
## 15 defense
## 16 special-attack
## 17 special-defense
## 18 speed
## 19 hp
## 20 attack
Save the dataframe so that we can do statistics on it with R
# cleaned up query with all required info for flying and ground type pokemon
flynGrndSPD_df = dbGetQuery(pokemonDatabase, "
SELECT pokemon.identifier AS name, types.identifier AS type, stats.identifier AS stat_name, pokemon_stats.base_stat
FROM pokemon_types
INNER JOIN pokemon ON pokemon_types.pokemon_id = pokemon.id
INNER JOIN types ON pokemon_types.type_id = types.id
INNER JOIN pokemon_stats ON pokemon_stats.pokemon_id = pokemon.id
INNER JOIN stats ON pokemon_stats.stat_id = stats.id
WHERE stat_name LIKE 'speed' AND (type LIKE 'flying' OR type LIKE 'ground')
")
# data frame with flying types and their speed stats
flyingSpeeds = flynGrndSPD_df[which(flynGrndSPD_df$type == "flying"), ]
# data frame with ground types and their speed stats
groundSpeeds = flynGrndSPD_df[which(flynGrndSPD_df$type == "ground"), ]
flynGrndSPD_df
## name type stat_name base_stat
## 1 charizard flying speed 100
## 2 butterfree flying speed 70
## 3 pidgey flying speed 56
## 4 pidgeotto flying speed 71
## 5 pidgeot flying speed 101
## 6 spearow flying speed 70
## 7 fearow flying speed 100
## 8 sandshrew ground speed 40
## 9 sandslash ground speed 65
## 10 nidoqueen ground speed 76
## 11 nidoking ground speed 85
## 12 zubat flying speed 55
## 13 golbat flying speed 90
## 14 diglett ground speed 95
## 15 dugtrio ground speed 120
## 16 geodude ground speed 20
## 17 graveler ground speed 35
## 18 golem ground speed 45
## 19 farfetchd flying speed 60
## 20 doduo flying speed 75
## 21 dodrio flying speed 100
## 22 onix ground speed 70
## 23 cubone ground speed 35
## 24 marowak ground speed 45
## 25 rhyhorn ground speed 25
## 26 rhydon ground speed 40
## 27 scyther flying speed 105
## 28 gyarados flying speed 81
## 29 aerodactyl flying speed 130
## 30 articuno flying speed 85
## 31 zapdos flying speed 100
## 32 moltres flying speed 90
## 33 dragonite flying speed 80
## 34 hoothoot flying speed 50
## 35 noctowl flying speed 70
## 36 ledyba flying speed 55
## 37 ledian flying speed 85
## 38 crobat flying speed 130
## 39 togetic flying speed 40
## 40 natu flying speed 70
## 41 xatu flying speed 95
## 42 hoppip flying speed 50
## 43 skiploom flying speed 80
## 44 jumpluff flying speed 110
## 45 yanma flying speed 95
## 46 wooper ground speed 15
## 47 quagsire ground speed 35
## 48 murkrow flying speed 91
## 49 gligar ground speed 85
## 50 gligar flying speed 85
## 51 steelix ground speed 30
## 52 swinub ground speed 50
## 53 piloswine ground speed 50
## 54 delibird flying speed 75
## 55 mantine flying speed 70
## 56 skarmory flying speed 70
## 57 phanpy ground speed 40
## 58 donphan ground speed 50
## 59 larvitar ground speed 41
## 60 pupitar ground speed 51
## 61 lugia flying speed 110
## 62 ho-oh flying speed 90
## 63 marshtomp ground speed 50
## 64 swampert ground speed 60
## 65 beautifly flying speed 65
## 66 taillow flying speed 85
## 67 swellow flying speed 125
## 68 wingull flying speed 85
## 69 pelipper flying speed 65
## 70 masquerain flying speed 60
## 71 nincada ground speed 40
## 72 ninjask flying speed 160
## 73 numel ground speed 35
## 74 camerupt ground speed 40
## 75 trapinch ground speed 10
## 76 vibrava ground speed 70
## 77 flygon ground speed 100
## 78 swablu flying speed 50
## 79 altaria flying speed 80
## 80 barboach ground speed 60
## 81 whiscash ground speed 60
## 82 baltoy ground speed 55
## 83 claydol ground speed 75
## 84 tropius flying speed 51
## 85 salamence flying speed 100
## 86 groudon ground speed 90
## 87 rayquaza flying speed 95
## 88 torterra ground speed 56
## 89 starly flying speed 60
## 90 staravia flying speed 80
## 91 staraptor flying speed 100
## 92 mothim flying speed 66
## 93 combee flying speed 70
## 94 vespiquen flying speed 40
## 95 gastrodon ground speed 39
## 96 drifloon flying speed 70
## 97 drifblim flying speed 80
## 98 honchkrow flying speed 71
## 99 chatot flying speed 91
## 100 gible ground speed 42
## 101 gabite ground speed 82
## 102 garchomp ground speed 102
## 103 hippopotas ground speed 32
## 104 hippowdon ground speed 47
## 105 mantyke flying speed 50
## 106 rhyperior ground speed 40
## 107 togekiss flying speed 80
## 108 yanmega flying speed 95
## 109 gliscor ground speed 95
## 110 gliscor flying speed 95
## 111 mamoswine ground speed 80
## 112 pidove flying speed 43
## 113 tranquill flying speed 65
## 114 unfezant flying speed 93
## 115 woobat flying speed 72
## 116 swoobat flying speed 114
## 117 drilbur ground speed 68
## 118 excadrill ground speed 88
## 119 palpitoad ground speed 69
## 120 seismitoad ground speed 74
## 121 sandile ground speed 65
## 122 krokorok ground speed 74
## 123 krookodile ground speed 92
## 124 sigilyph flying speed 97
## 125 archen flying speed 70
## 126 archeops flying speed 110
## 127 ducklett flying speed 55
## 128 swanna flying speed 98
## 129 emolga flying speed 103
## 130 stunfisk ground speed 32
## 131 golett ground speed 35
## 132 golurk ground speed 55
## 133 rufflet flying speed 60
## 134 braviary flying speed 80
## 135 vullaby flying speed 60
## 136 mandibuzz flying speed 80
## 137 tornadus-incarnate flying speed 111
## 138 thundurus-incarnate flying speed 111
## 139 landorus-incarnate ground speed 101
## 140 landorus-incarnate flying speed 101
## 141 diggersby ground speed 78
## 142 fletchling flying speed 62
## 143 fletchinder flying speed 84
## 144 talonflame flying speed 126
## 145 vivillon flying speed 89
## 146 hawlucha flying speed 118
## 147 noibat flying speed 55
## 148 noivern flying speed 123
## 149 yveltal flying speed 99
## 150 zygarde ground speed 95
## 151 wormadam-sandy ground speed 36
## 152 shaymin-sky flying speed 127
## 153 rotom-fan flying speed 86
## 154 tornadus-therian flying speed 121
## 155 thundurus-therian flying speed 101
## 156 landorus-therian ground speed 91
## 157 landorus-therian flying speed 91
## 158 charizard-mega-y flying speed 100
## 159 pinsir-mega flying speed 105
## 160 aerodactyl-mega flying speed 150
## 161 garchomp-mega ground speed 92
## 162 swampert-mega ground speed 70
## 163 steelix-mega ground speed 30
## 164 pidgeot-mega flying speed 121
## 165 groudon-primal ground speed 90
## 166 rayquaza-mega flying speed 115
## 167 camerupt-mega ground speed 20
## 168 salamence-mega flying speed 120
In order to figure out what hypothesis test to choose, lets figure out if our data follows a normal distribution.
hist(flyingSpeeds$base_stat, main = "Flying Type Speed Distribution", xlab = "Speed stat")
hist(groundSpeeds$base_stat, main = "Ground Type Speed Distribution", xlab = "Speed stat")
qqnorm(flyingSpeeds$base_stat, main = "Normal Q-Q Plot for Flying Types")
qqline(flyingSpeeds$base_stat, col = "red")
qqnorm(groundSpeeds$base_stat, main = "Normal Q-Q Plot for Ground Types")
qqline(groundSpeeds$base_stat, col = "red")
While the data doesnt appear to be normal, it helps to do a formal
test to make sure that we are making an accurate interpretation of our
data that isnt solely reliant on visuals.
In order to do that, we’ll conduct a Shapiro Wilk Test
on a random sample from both groups.
First we need to take random samples from both groups with sample size n = 30.
flySpeedSample = sample(groundSpeeds$base_stat, 30, replace = TRUE)
grndSpeedSample = sample(flyingSpeeds$base_stat, 30, replace = TRUE)
Then we can use R’s built in function for the Shapiro test for normality.
shapiro.test(flySpeedSample)
##
## Shapiro-Wilk normality test
##
## data: flySpeedSample
## W = 0.90575, p-value = 0.01164
shapiro.test(grndSpeedSample)
##
## Shapiro-Wilk normality test
##
## data: grndSpeedSample
## W = 0.97671, p-value = 0.733
Lets interpret these results: For info, the Shapiro Wilk test is a
hypothesis test where the null hypothesis is that a random sample comes
from a normally distributed dataset.
The W statistic is the result of the Shapiro Wilk Test, which measures
how close the sample matches a normal distribution, with 1 being normal
and 0 being not normal. The P value determines if we can reject the null
hypothesis that the sample comes from a normally distributed dataset. If
the p value result is <= 0.05, we can reject the null hypothesis,
however if the p value result is > 0.05 we are unable to reject the
null hypothesis that the random sample comes from a normally distributed
dataset.
However, we encounter a problem. Depending on the random sample, the
samples either fail or pass the test. We want to see if it is passing
this test a reliable number of times. So lets build a function that does
the shapiro wilk test on the 100 different samples, and if 95% of them
are passing, we can reasonable say that they pass the shapiro wilk
test.
shapiroSample = function(dataset, reps, sampleSize){
pass_vec = c()
for(i in 1:reps){
sample = sample(dataset, sampleSize, replace = TRUE)
outcome = shapiro.test(sample)
# checks to see if this sample rejects the null hypothesis or not
# if it does adds a 0 for fail, and if not adds a 1 for pass
if(outcome[2] > 0.05){pass_vec[i] = 1}
else if(outcome[2] <= 0.05){pass_vec[i] = 0}
}
return(pass_vec)
}
Now lets execute this function for our data.
mean(shapiroSample(flyingSpeeds$base_stat, 100, 30))
## [1] 0.86
mean(shapiroSample(groundSpeeds$base_stat, 100, 30))
## [1] 0.69
Since the percentage of samples out of 100 that pass the Shapiro Wilk
test are nowhere close to 95%, or even 90%, for both flying and ground
types, we can say that our data is not normally distributed.
So what do we do now?
We need to find an appropriate statistical test that will give us the
findings we are looking for without normality of the data being a
requirement.
dbDisconnect(pokemonDatabase)